New pass Reduce variable liveness #3965

mfrancepillois · 2025-04-18T10:52:41Z

Add a new pass to reduce the variable liveness by prefetching data then moving load op closer to use-op.

…d-op closer to use-op. Add a test.

third_party/intel/lib/TritonIntelGPUTransforms/ReduceRegisterPressure.cpp

test/TritonIntelGPU/reduce-variable-liveness.mlir

Signed-off-by: Maxime France-Pillois <[email protected]>

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

test/TritonIntelGPU/reduce-variable-liveness.mlir

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

Signed-off-by: Maxime France-Pillois <[email protected]>

whitneywhtsang

There is a loop sink pass in IGC. Can you please create an issue for IGC team to investigate why it doesn't catch the case of FA with the shape that gives the most gain?

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

whitneywhtsang · 2025-05-02T17:15:45Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+
+/// Create a prefetch operation for the given load operation.
+static void createPrefetchOp(tt::LoadOp loadOp) {
+  Operation *op = loadOp.getPtr().getDefiningOp();


when did we check that loadOp.getPtr() is an operation? do we need to add that to isLoadCandidate?
Or should we add the support of when pointer is a region argument?

Thanks for noticing. A check has been added to isLoadCandidate.
As the pass adds a prefetch right after the defining op, I'm concerned that adding this prefetch in another region (in the case the load ptr has been defined in another region) could have side effects on the cache (as an early data fetch could mean evincing data that are still needed).

do we care about the case that the pointer directly come from function argument?

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

chengjunlu · 2025-05-06T02:03:49Z

It is good to have the reduce variable liveness as the beginning for liveness optimization in the Triton middle end.
This PR looks good to me as the beginning.

The optimization relies on the cache to hold the values that we may reuse in the loop. But the cache system is not fully controllable by the program. The better we can enhance it with the usage of shared local memory and make it some how like RegisterToMem pass for general case.

etiotto · 2025-05-06T15:07:38Z

@mfrancepillois can you do a Triton Benchmark run with this PR to identify improvement (or degradations - hopefully none) in all the microbmks we have ?

Signed-off-by: Maxime France-Pillois <[email protected]>

test/TritonIntelGPU/reduce-variable-liveness.mlir

whitneywhtsang · 2025-05-12T17:10:17Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

                            Operation *forOp) {
  // Only pointer to tensor are considered to be moved
-  if (!mlir::triton::isTensorPointerType(loadOp.getPtr().getType()))
+  if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))


[optional]

Suggested change

if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))

if (!mlir::triton::isTensorPointerType(loadOp.getResult().getType()))

This would limite the optimization to block pointer loads. That is conservative and I am OK with limiting the pass in this PR. Generally speaking the pass should work for tensor of ptrs as well as block pointers.

The current pass does handle block pointer AND pointer of tensors (with the condition that the load has an empty mask).

Signed-off-by: Maxime France-Pillois <[email protected]>

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

…be moved.

mfrancepillois · 2025-05-27T10:52:31Z

@mfrancepillois can you do a Triton Benchmark run with this PR to identify improvement (or degradations - hopefully none) in all the microbmks we have ?

After a few improvements to this pass (handling multiple users for the loadOp and improving the condition for a loadOp to be elected as movable), CIs have been run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15215096477/job/42798609994

For flash-attention, we have the following performance:

Other benchmarks do not seem to be significantly impacted by this pass.

etiotto · 2025-05-28T13:44:50Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+      // each "for loop" given that the liveness of variables may have changed
+      // as a result of the code, and specifically `LoadOps`, being modified
+      // by the pass.
+      Liveness livenessAnalysis(rootOperation);


To reduce compile time we should detect whether the pass made any changes to the code and only rerun the analysis if changes were made.

The code has been modified to run the analysis only when needed.

etiotto · 2025-05-28T13:46:09Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+    }
+
+    Operation *rootOperation = getOperation();
+    rootOperation->walk([&](scf::ForOp forOp) {


Ok, the pass for now only handles one kind of loop (scf.for). Is OK as a first cut, we might want/need to enhance it to also support while loops in the future.

A comment has been added to keep track of this.

etiotto

Initial round of code review comments.

etiotto · 2025-05-28T13:49:44Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+#define LARGE_TENSOR_SIZE_THRESHOLD_IN_BYTES                                   \
+  LARGE_TENSOR_MAJOR_SHAPE_THRESHOLD *LARGE_TENSOR_MINOR_SHAPE_THRESHOLD * 2
+
+static unsigned getSizeInBytes(RankedTensorType &tensorType) {


Add documentation for this function and the next pls.

[nit] static is unnecessary because these utilities are in an anonymous namespace.

etiotto · 2025-05-28T13:50:51Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
+#include "triton/Dialect/TritonGPU/IR/Dialect.h"
+#include "triton/Dialect/TritonGPU/Transforms/Passes.h"
+#include "llvm/Support/Debug.h"


[nit] move in the section where other llvm include headers are "included".

etiotto · 2025-05-28T13:52:14Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+#include "triton/Dialect/TritonGPU/Transforms/Passes.h"
+#include "llvm/Support/Debug.h"
+
+#include "intel/include/Analysis/Liveness.h"


[nit]" lets try to keep include headers in their sections (all intel headers together, all triton upstream headers together, etc...)

etiotto · 2025-05-28T13:53:15Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+
+namespace {
+
+#define TOTAL_BLOCK_SIZE_THRESHOLD_IN_BYTES 32768


Suggest to use C++ static constexpr instead of #defines.

The code has been updated this way.

etiotto · 2025-05-28T13:55:08Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+  // The variable is considered as a long life span elected for being moved if:
+  // The live-in variables of the forOp consist in a large amount of bytes and
+  // The variable defined by `v` is a large tensor (with large amount of element
+  // in the minor dimenssion) and The variable liveness of `v` expends before


etiotto · 2025-05-28T14:05:19Z

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp

+    return false;
+
+  for (triton::DotOp dot : dotsInFor) {
+    auto aVals = getLoad(dot.getA());


Use static types on LHS pls.

test/TritonIntelGPU/reduce-variable-liveness.mlir

etiotto · 2025-05-28T14:07:04Z

test/TritonIntelGPU/reduce-variable-liveness.mlir

+#dot1 = #ttg.dot_op<{opIdx = 1, parent = #dpas, kWidth=2}>
+module attributes {ttig.support_sg_2d_block, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32} {
+  tt.func public @matmul_kernel_small_tensor(%arg0: !tt.ptr<f16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16> {tt.divisibility = 16 : i32}) {
+    // CHECK-LABEL:   tt.func public @matmul_kernel_small_tensor


remove tt.func public here

test/TritonIntelGPU/reduce-variable-liveness.mlir

etiotto · 2025-05-28T14:23:39Z

test/TritonIntelGPU/reduce-variable-liveness.mlir

+    ttig.prefetch %1 {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 0, 0>} : !tt.ptr<tensor<64x256xf16, #dot1>>
+    %4:2 = scf.for %arg2 = %c0_i32 to %c64_i32 step %c64_i32 iter_args(%arg3 = %cst, %arg4 = %1) -> (tensor<16x256xf32, #dpas>, !tt.ptr<tensor<64x256xf16, #dot1>>)  : i32 {
+      // CHECK:  scf.for
+      // CHECK-NOT:  tt.load {{.*}} : !tt.ptr<tensor<16x64xf16, #ttg.dot_op<{opIdx = 0, parent = #[[$DPAS]], kWidth = 1}>>>


OK so this test checks that the load for operand A (opIdx==0) is not sinked into the loop. Would be helpful to add a COM to all the tests to briefly explain what each test is designed to cover.

Comments have been added to describe the goal of each tests.

…ents and improve code quality. Signed-off-by: Maxime France-Pillois <[email protected]>

Add new pass that tries to reduce the register pressure by moving loa…

815681b

…d-op closer to use-op. Add a test.

mfrancepillois requested review from whitneywhtsang, etiotto and a team April 18, 2025 10:52

This comment was marked as outdated.

Sign in to view

mfrancepillois changed the title ~~Add pass: Reduce the register pressure~~ New pass Reduce register pressure Apr 18, 2025

mfrancepillois linked an issue Apr 18, 2025 that may be closed by this pull request

[FA performance] Improve the Q matrix load stategy #3966

Open

mfrancepillois changed the title ~~New pass Reduce register pressure~~ [Draft] New pass Reduce register pressure Apr 18, 2025

mfrancepillois marked this pull request as draft April 18, 2025 11:40

Fix types mismatch bug.

34955bf

mfrancepillois marked this pull request as ready for review April 18, 2025 16:31

mfrancepillois changed the title ~~[Draft] New pass Reduce register pressure~~ New pass Reduce register pressure Apr 18, 2025

Merge branch 'main' into maxime/reduceRegisterPressure

3915019

etiotto reviewed Apr 23, 2025

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/ReduceRegisterPressure.cpp Outdated Show resolved Hide resolved

Rename pass + use liveness analysis for heuristic.

bdf0999

mfrancepillois changed the title ~~New pass Reduce register pressure~~ New pass Reduce variable liveness Apr 24, 2025

mfrancepillois added 2 commits April 24, 2025 17:01

Fix typo

ac4da4b

Merge branch 'main' into maxime/reduceRegisterPressure

f34ef03

etiotto reviewed Apr 25, 2025

View reviewed changes

test/TritonIntelGPU/reduce-variable-liveness.mlir Outdated Show resolved Hide resolved

mfrancepillois added 3 commits April 29, 2025 13:48

Improve heuristic.

ec4394a

Merge branch 'main' into maxime/reduceRegisterPressure

a937fb8

Add new test cases to match new heuristic conditions.

4d17f01

Signed-off-by: Maxime France-Pillois <[email protected]>

whitneywhtsang reviewed Apr 29, 2025

View reviewed changes

mfrancepillois added 6 commits April 30, 2025 15:15

Improves the heuristic + addresses comments

e84b4ca

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

67aec58

Enforce only ConvertLayoutOp between dot and load + bug fix.

db2c1ed

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

79ab6ed

Bug fix

6fd22d8

Signed-off-by: Maxime France-Pillois <[email protected]>

remove unused variable.

9428627

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as draft April 30, 2025 16:53

Add element type check.

3bae4e6

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as ready for review April 30, 2025 17:46

etiotto requested review from alexbaden and chengjunlu May 1, 2025 19:37

whitneywhtsang reviewed May 2, 2025

View reviewed changes

chengjunlu approved these changes May 6, 2025

View reviewed changes

mfrancepillois added 3 commits May 7, 2025 11:34

Address comments: improve code quality.

d92e667

Merge branch 'main' into maxime/reduceRegisterPressure

68f9b48

Extend support to handle tensor of pointers without mask + add test.

3996a60

Signed-off-by: Maxime France-Pillois <[email protected]>

mfrancepillois marked this pull request as draft May 12, 2025 13:11

Merge branch 'main' into maxime/reduceRegisterPressure

6971b37

whitneywhtsang reviewed May 12, 2025

View reviewed changes

test/TritonIntelGPU/reduce-variable-liveness.mlir Outdated Show resolved Hide resolved

whitneywhtsang reviewed May 12, 2025

View reviewed changes

mfrancepillois added 2 commits May 13, 2025 16:54

Add support for multiple users

2da8a5d

Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

ddc692b

whitneywhtsang reviewed May 13, 2025

View reviewed changes

third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp Show resolved Hide resolved

mfrancepillois and others added 6 commits May 20, 2025 12:45

Merge branch 'main' into maxime/reduceRegisterPressure

2456692

Merge branch 'main' into maxime/reduceRegisterPressure

7ce5195

Improve conditions on tensor sizes and shapes for being electable to …

bbe549d

…be moved.

Merge branch 'main' into maxime/reduceRegisterPressure

7be804a

Merge branch 'main' into maxime/reduceRegisterPressure

d05ae8a

Rename triton_intel_gpu mnemonic to ttig

057e0d5

mfrancepillois marked this pull request as ready for review May 27, 2025 10:52

etiotto reviewed May 28, 2025

View reviewed changes

mfrancepillois and others added 2 commits May 29, 2025 12:00

Addresses comments: run liveness analysis only when needed + add comm…

a6afe00

…ents and improve code quality. Signed-off-by: Maxime France-Pillois <[email protected]>

Merge branch 'main' into maxime/reduceRegisterPressure

ba8682e

	if (!mlir::triton::isTensorOrTensorPointerType(loadOp.getPtr().getType()))
	if (!mlir::triton::isTensorPointerType(loadOp.getResult().getType()))


		namespace {

		#define TOTAL_BLOCK_SIZE_THRESHOLD_IN_BYTES 32768

New pass Reduce variable liveness #3965

Are you sure you want to change the base?

New pass Reduce variable liveness #3965

Uh oh!

Conversation

mfrancepillois commented Apr 18, 2025 • edited by etiotto Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chengjunlu commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etiotto commented May 6, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mfrancepillois commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiotto May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etiotto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

mfrancepillois commented Apr 18, 2025 •

edited by etiotto

Loading

chengjunlu commented May 6, 2025 •

edited

Loading

etiotto May 28, 2025 •

edited

Loading